A preliminary visual investigation of the relationship between the performance of NBA lineups and the players within them.

Getting NBA Data

The pages at stats.nba.com are backed by a great set of json APIs, making it easy to work with their data. They have an extensive stats for lineups, players, and a lot more.

Some Libraries We’ll Need

library(rjson)
library(dplyr)

Getting data from stats.nba.com into R

I used the rjson library to download the json and convert it into an R data frame. The following helper function, given a url, the number of columns, and a list of numeric columns, will fetch the json, convert the data into a matrix, then convert it into a data frame.

df_from_url = function(url, ncol, number_columns) {
    json = fromJSON(file = url, method = "C")
    df = data.frame(matrix(unlist(json$resultSets[[1]][[3]]), ncol = ncol, byrow = TRUE), 
        stringsAsFactors = FALSE)
    colnames(df) = json$resultSets[[1]][[2]]
    df[, number_columns] = apply(df[, number_columns], 2, function(x) as.numeric(as.character(x)))
    return(df)
}

Some Setup

Years

The APIs take seasons as strings, so we need to convert from the years in question to the formatted season strings. 2007 is the first year for which they have lineup data.

years = sapply(2007:2015, function(year) sprintf("%4d-%02d", year, (year + 1)%%100))

Team Ids

This is absolute overkill, because none of the team ids have changed over the year range we’re interested in, but I didn’t know that for sure until after I’d ran it.

team_fmt = "http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per100Plays&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=%s&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision="
team_urls = sapply(years, function(year) sprintf(team_fmt, year))
team_dfs = sapply(team_urls, function(url) df_from_url(url, 30, c(1, 3:29)))
team_ids = Reduce(union, team_dfs[1, ])

Player Data

First, we download the player data. We’ll loop over the years and NBA stat collections, and then combine all the data together with merge and rbind into one big data frame. We request stats per 100 plays, but the API seems to intelligently determine when to respect that.

columns = c(35, 32, 24, 27, 30)
stat_types = c("Base", "Advanced", "Misc", "Scoring", "Usage")
player_fmt = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=%s&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per100Plays&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=%s&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
players = NULL
for (year in years) {
    season_df = NULL
    for (i in 1:length(stat_types)) {
        stat_type = stat_types[i]
        c = columns[i]
        numeric_columns = c(1, 3, 5:(c - 1))
        url = sprintf(player_fmt, stat_type, year)
        df = df_from_url(url, c, numeric_columns)
        if (is.null(season_df)) {
            season_df = df
        } else {
            season_df = merge(season_df, df, by = 1, all.x = TRUE, suffixes = c("", 
                sprintf("_%s", stat_type)))
        }
    }
    season_df$SEASON = factor(year)
    if (is.null(players)) {
        players = season_df
    } else {
        players = rbind(players, season_df)
    }
}

Lineup Data

Fetching lineup data is similar; however, the API is limited to 250 entries per response, so we loop through the years, teams, and stat groups. This results in well over a thousand API calls, and can take a very long time to run.

stat_types = c("Base", "Advanced", "Four+Factors", "Misc", "Scoring", "Opponent")
columns = c(31, 24, 18, 18, 25, 31)
lineup_fmt = "http://stats.nba.com/stats/leaguedashlineups?Conference=&DateFrom=&DateTo=&Division=&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&MeasureType=%s&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per100Plays&Period=0&PlusMinus=N&Rank=N&Season=%s&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&TeamID=%d&VsConference=&VsDivision="
lineups = NULL
for (year in years) {
    for (team in team_ids) {
        season_df = NULL
        for (i in 1:length(stat_types)) {
            stat_type = stat_types[i]
            c = columns[i]
            numeric_columns = c(4, 6:c)
            url = sprintf(lineup_fmt, stat_type, year, team)
            df = df_from_url(url, c, numeric_columns)
            if (is.null(season_df)) {
                season_df = df
            } else {
                season_df = merge(season_df, df, by = 2, all.x = TRUE, suffixes = c("", 
                  sprintf("_%s", stat_type)))
            }
        }
        season_df$SEASON = factor(year)
        if (is.null(lineups)) {
            lineups = season_df
        } else {
            lineups = rbind(lineups, season_df)
        }
    }
}

Player Cleanup

The API treats minutes differently depending upon the stat group requested. Advanced appears to return minutes per game while Usage returns total minutes. We rename these appropriately.

players = mutate(players, MIN_TOTAL = MIN_Usage, MIN_GAME = MIN_Advanced)
players = select(players, -X, -matches("CFID|CFPARAMS|_[A-Z][a-z]", FALSE))

Lineup Cleanup

For lineups, however, Advanced seems to return the total minutes. Once again, we rename the column.

lineups = tbl_df(lineups)
lineups = mutate(lineups, MIN_TOTAL = MIN_Advanced)
lineups = select(lineups, -X, -matches("GROUP_SET|CFID|CFPARAMS|_[A-Z][a-z]", FALSE))
lineups = Filter(function(x) !all(is.na(x)), lineups)

Identifying players

To match the lineup data to the player data, we need to identify the players in each lineup. Parsing the GROUP_ID allows us to do that.

lineups$PLAYERS = t(sapply(lineups$GROUP_ID, function(x) {
    as.integer(unlist(strsplit(as.character(x), split = " - ")))
}))

Calculating Player Averages for a Lineup

For a given lineup, we find the stats for the players in the lineup and average them. For some stats, this makes sense. For others, it won’t. In many cases, we’ll be more interested in the sum, but we can get that later by multiplying by 5.

season_col = grep("SEASON", colnames(lineups))
player_col = grep("PLAYERS", colnames(lineups))
numeric_player_columns = as.vector(which(sapply(players, is.numeric)))
lineup_averages = data.frame(t(apply(lineups, 1, function(x) {
    srows = players$SEASON == x[season_col]
    prows = players$PLAYER_ID %in% as.numeric(x[player_col:player_col + 4])
    sapply(players[srows & prows, numeric_player_columns], mean)
})))

Calculating Usage-Weighted Averages for a Lineup

We do something similar to calculate the usage-weighted averages for a lineup. This won’t make any sense for most stats, but for many offensive stats, it should provide a more reasonable estimate than a straight average.

usg_weighted = data.frame(t(apply(lineups, 1, function(x) {
    srows = players$SEASON == x[season_col]
    prows = players$PLAYER_ID %in% as.numeric(x[player_col:player_col + 4])
    stats = players[srows & prows, numeric_player_columns]
    tot_usg = sum(stats$USG_PCT_PCT)
    sapply(stats, function(y) sum(y * stats$USG_PCT_PCT)/tot_usg)
})))

Putting it all together

Finally, we add suffixes to our lineups, averages, and usage-weighted averages, and merge them all together into a gigantic data frame.

colnames(lineups) = paste(names(lineups), "lineup", sep = ".")
colnames(lineup_averages) = paste(names(lineup_averages), "player", sep = ".")
colnames(usg_weighted) = paste(names(usg_weighted), "usage", sep = ".")
nba = merge(merge(lineups, lineup_averages, by = 0), usg_weighted, by = 0)
nba$Row.names = NULL
nba$Row.names = NULL
dim(nba)
## [1] 66784   259

Net Rating

In the end, what we really care about is the Net Rating of lineups. Will our lineup score more points than their opponents? It’s important to note stats.nba.com formulation of Net Rating (and hence Offensive Rating and Defensive Rating) for players is essentially scaled +/-, and is distinct from the Dean Oliver version of these stats.

Net Rating is point differential per 100 possessions. +/- in our data is per 100 plays. The deviations from a straight line should be due to the difference between possessions and plays (which will be a function of things like offensive rebounding).

Lineups with ‘better’ players produce better results. We’ve solved basketball! More seriously, while this suggests that +/- may have some predictive power, more investigation is needed to determine whether this is really predictive on the basis of individuals, or whether this is just a case of the known problems with +/- data pushing both variables in the same direction.

Data Issues?

Can we really trust this data?

For lineups with very small numbers of minutes, can we trust their values? One issue in our data is that we don’t have fractional minutes. Cutting on minutes suggests that while we have the same general trend for infrequently used lineups, the data seems a bit off from what we’d expect.

Drilling down into the one to ten minute range, it looks like we have reason to be skeptical of lineups with less than 5 minutes of play.

Furthermore, we should be concerned about lineups involving players who haven’t played with other players.

We’ll limit ourselves to lineups where the lineup accounts less than 10% of the minutes of the players in the lineup.

Final Cleanup

We’ll also remove lineups with crazy values for either the player or lineup NET_RATING.

mins_ok = nba$MIN_TOTAL.lineup > 4
ratio_ok = nba$MIN_TOTAL.lineup/nba$MIN_TOTAL.player <= 0.1
lineup_ok = nba$NET_RATING.lineup > -50 & nba$NET_RATING.lineup < 50
player_ok = nba$NET_RATING.player > -10 & nba$NET_RATING.player < 10
nba = nba[mins_ok & ratio_ok & lineup_ok & player_ok, ]

I’m not sure any of this would stand up to statistical scrutiny, but it should be okay for drawing pretty pictures.

Now What?

Here’s a very small subset of our data, showing the 3 types of data in our data frame.

kable(head(select(nba, GROUP_ID.lineup, SEASON.lineup, matches("^PTS\\."))))
GROUP_ID.lineup SEASON.lineup PTS.lineup PTS.player PTS.usage
2551 - 101122 - 101127 - 201177 - 2211 2008-09 106.1 18.3 4.5
201589 - 200768 - 201564 - 2545 - 2624 2008-09 90.3 10.6 16.7
201147 - 2744 - 201567 - 201196 - 2863 2009-10 97.6 12.4 14.9
201147 - 2744 - 2545 - 201196 - 2863 2009-10 85.8 12.4 14.9
201567 - 2545 - 2562 - 201196 - 2863 2009-10 113.4 12.4 17.3
201605 - 2562 - 200762 - 201196 - 2863 2009-10 89.2 12.4 17.1
  • .lineup indicates stats for a given lineup.
  • .player indicates the average (full-season)stats for the players in a lineup.
  • .usage indicates the usage-weighted (sull-season) stats for the players in a lineup.

Efficiency vs. Usage Tradeoff

One of the fundamental questions within basketball analytics is the value of “shot creation”. Is “shot creation” a valuable skill, or should it be thought of more as “shot taking”? Should high-volume scorers with below average efficiency be seen as “stars” or are they actually hurting their teams by shooting more than they should?

We’ll look at the impact of Usage on both Offensive Rating (reminder: the NBA version, not the Dean Oliver version) and True Shooting Percentage. True Shooting is an overall measure of shot efficiency taking into account 2 pointers, 3 pointers, and free throws.

We’ll limit ourselves to Usage between 10% and 30%, because anyone outside those ranges seems to be an outlier.

Players

For players, we see a wiggly, but generally upward trend between usage and our offensive metrics. Naively, this might seem to contract the efficiency/usage tradeoff, as higher-usage players seem to have higher efficiency in general. However, this likely reflects the decisions of coaches. If someone can’t shoot, we tend to discourage them from doing so. It doesn’t tell us what would happen to a player’s efficiency if they increased or decreased their volume.

Lineups

Switching to lineup data, we seem the same general trends. Lineups consisting of higher-usage players seem to perform better. Once again, this is likely because better shooters get to shoot more.

Usage-Weighting

When we look at our usage-weighted stats, we don’t see a strong relationship between them and our lineup stats, but it’s not all that clear how to interpret this.

Eli Witus

Eli Witus, who is now employed by an NBA team, came up with a better way of looking at this (and a better way of explaining it). He considers the null hypothesis that the lineup efficiency should be predicted by the usage-weighted efficiency. He uses Dean Oliver’s Offensive Rating calculation. We’ll instead use the NBA’s version of Offensive Rating as well as looking at True Shooting Percentage, which Witus didn’t investigate but suggested might produce interesting results.

These plots show the difference between our observed lineup efficiency and that predicted by the null hypothesis:

We see that lineups full of high usage players outperform expectations while lineups full of low usage players underperform expectations. Since a lineup must use 100% of its possessions, we can interpret this as lineup efficiency dropping in situations where players are forced to increase volume above their norms, while their efficiency increases in situations where they are abele to reduce their volume.

Turnovers? Assists?

Eli Witus posited that turnovers might also be a factor. The above plots show the difference between lineup and expected totals for turnovers and assists. There doesn’t seem to be much of an impact on assist numbers, but we see a jump in turnovers as players are forced to increase their usage.

Rebounding

Applying a similar approach to rebounding, it looks like there’s hardly any relationship between the rebounding numbers of a lineup and its constituent players. It suggests that no matter who you throw out there, they’re going to get about 50% of the available rebounds. Note that we’re no longer using usage-weighted numbers.

Breaking it down to offensive and defensive rebounding, over typical ranges of player rebounding, we don’t seem to see the players having an impact on the lineups.

Rebound value

If anything, lineups full of rebounders seem to perform poorly

Offense vs. Defense

Both offensive and defensive ratings for players seem to impact lineup net rating in the direction we’d expect.

Diminishing Returns

For offense, we see diminishing returns. Beyong a certain point, more offensive talent doesn’t seem to improve the offensive performance of the lineup. Defensively, however, we don’t seem to see the same effect. At least over normal ranges, it seems you can never have too much defense.

3 Point Shooting

3 point shooting is a big deal these days. Let’s look at this…

Historic Volume

I’ve had to reduce the length of whiskers to one-half the inter-quartile range in order to make things fit. There is a lot of variability among lineups. Despite all that variability, we see an upward trend in recent years.

Is it helping?

Looking solely at lineup data, we see that taking and making more 3s seems to make your offense better.

Is this just the effect of time?

This trend holds up for every season.

More shots == more makes

You have to take them to make them.

We need more shooters!

So, we should just load up on shooters, right?

What’s going on?

More Shooting != More Shooting

It seems that the volume of 3 pointers made by a lineup doesn’t really depend upon the volume of 3 pointers taken by the players in that lineup. It may be that this is far more dependent on strategy, but this definitely needs more investigation.